The dataset is a public dataset hosted in BigQuery containing New York City taxi and limousine trips collected by the NYC Taxi and Limousine Commission (TLC). It includes trip records from all trips in yellow and green taxis, and all for-hire vehicles (FHV). This analysis will focus on yellow and green taxi trip data only, excluding FHV trip data as it is found to be missing important columns such as trip distance and itemized fares.
data source: https://console.cloud.google.com/marketplace/details/city-of-new-york/nyc-tlc-trips
Data is first extracted from BigQuery public database and integrated into a separate database in a personal public project. Data from different tables are transformed into one table. Data cleaning was performed in BigQuery which applied the following conditions.
BigQuery code:
SELECT *
FROM (
SELECT vendor_id, pickup_datetime, dropoff_datetime, trip_distance, rate_code, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, imp_surcharge, NULL AS ehail_fee, airport_fee, total_amount, pickup_location_id, dropoff_location_id, "Yellow" AS taxi_type , EXTRACT(YEAR FROM pickup_datetime) AS pickup_year, EXTRACT(HOUR FROM pickup_datetime) AS pickup_hour, DATE(pickup_datetime) AS pickup_date, TIME(pickup_datetime) AS pickup_time
FROM bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2011
WHERE EXTRACT(YEAR FROM pickup_datetime) = 2011 AND passenger_count > 0
UNION ALL
SELECT vendor_id, pickup_datetime, dropoff_datetime, trip_distance, rate_code, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, imp_surcharge, NULL AS ehail_fee, airport_fee, total_amount, pickup_location_id, dropoff_location_id, "Yellow" AS taxi_type , EXTRACT(YEAR FROM pickup_datetime) AS pickup_year, EXTRACT(HOUR FROM pickup_datetime) AS pickup_hour, DATE(pickup_datetime) AS pickup_date, TIME(pickup_datetime) AS pickup_time
FROM bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2012
WHERE EXTRACT(YEAR FROM pickup_datetime) = 2012 AND passenger_count > 0
UNION ALL
SELECT vendor_id, pickup_datetime, dropoff_datetime, trip_distance, rate_code, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, imp_surcharge, NULL AS ehail_fee, airport_fee, total_amount, pickup_location_id, dropoff_location_id, "Yellow" AS taxi_type , EXTRACT(YEAR FROM pickup_datetime) AS pickup_year, EXTRACT(HOUR FROM pickup_datetime) AS pickup_hour, DATE(pickup_datetime) AS pickup_date, TIME(pickup_datetime) AS pickup_time
FROM bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2013
WHERE EXTRACT(YEAR FROM pickup_datetime) = 2013 AND passenger_count > 0
UNION ALL
SELECT vendor_id, pickup_datetime, dropoff_datetime, trip_distance, rate_code, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, imp_surcharge, NULL AS ehail_fee, airport_fee, total_amount, pickup_location_id, dropoff_location_id, "Yellow" AS taxi_type , EXTRACT(YEAR FROM pickup_datetime) AS pickup_year, EXTRACT(HOUR FROM pickup_datetime) AS pickup_hour, DATE(pickup_datetime) AS pickup_date, TIME(pickup_datetime) AS pickup_time
FROM bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2014
WHERE EXTRACT(YEAR FROM pickup_datetime) = 2014 AND passenger_count > 0
UNION ALL
SELECT vendor_id, pickup_datetime, dropoff_datetime, trip_distance, rate_code, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, imp_surcharge, NULL AS ehail_fee, airport_fee, total_amount, pickup_location_id, dropoff_location_id, "Yellow" AS taxi_type , EXTRACT(YEAR FROM pickup_datetime) AS pickup_year, EXTRACT(HOUR FROM pickup_datetime) AS pickup_hour, DATE(pickup_datetime) AS pickup_date, TIME(pickup_datetime) AS pickup_time
FROM bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2015
WHERE EXTRACT(YEAR FROM pickup_datetime) = 2015 AND passenger_count > 0
UNION ALL
SELECT vendor_id, pickup_datetime, dropoff_datetime, trip_distance, rate_code, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, imp_surcharge, NULL AS ehail_fee, airport_fee, total_amount, pickup_location_id, dropoff_location_id, "Yellow" AS taxi_type , EXTRACT(YEAR FROM pickup_datetime) AS pickup_year, EXTRACT(HOUR FROM pickup_datetime) AS pickup_hour, DATE(pickup_datetime) AS pickup_date, TIME(pickup_datetime) AS pickup_time
FROM bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2016
WHERE EXTRACT(YEAR FROM pickup_datetime) = 2016 AND passenger_count > 0
UNION ALL
SELECT vendor_id, pickup_datetime, dropoff_datetime, trip_distance, rate_code, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, imp_surcharge, NULL AS ehail_fee, airport_fee, total_amount, pickup_location_id, dropoff_location_id, "Yellow" AS taxi_type , EXTRACT(YEAR FROM pickup_datetime) AS pickup_year, EXTRACT(HOUR FROM pickup_datetime) AS pickup_hour, DATE(pickup_datetime) AS pickup_date, TIME(pickup_datetime) AS pickup_time
FROM bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2017
WHERE EXTRACT(YEAR FROM pickup_datetime) = 2017 AND passenger_count > 0
UNION ALL
SELECT vendor_id, pickup_datetime, dropoff_datetime, trip_distance, rate_code, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, imp_surcharge, NULL AS ehail_fee, airport_fee, total_amount, pickup_location_id, dropoff_location_id, "Yellow" AS taxi_type , EXTRACT(YEAR FROM pickup_datetime) AS pickup_year, EXTRACT(HOUR FROM pickup_datetime) AS pickup_hour, DATE(pickup_datetime) AS pickup_date, TIME(pickup_datetime) AS pickup_time
FROM bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2018
WHERE EXTRACT(YEAR FROM pickup_datetime) = 2018 AND passenger_count > 0
UNION ALL
SELECT vendor_id, pickup_datetime, dropoff_datetime, trip_distance, rate_code, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, imp_surcharge, NULL AS ehail_fee, airport_fee, total_amount, pickup_location_id, dropoff_location_id, "Yellow" AS taxi_type , EXTRACT(YEAR FROM pickup_datetime) AS pickup_year, EXTRACT(HOUR FROM pickup_datetime) AS pickup_hour, DATE(pickup_datetime) AS pickup_date, TIME(pickup_datetime) AS pickup_time
FROM bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2019
WHERE EXTRACT(YEAR FROM pickup_datetime) = 2019 AND passenger_count > 0
UNION ALL
SELECT vendor_id, pickup_datetime, dropoff_datetime, trip_distance, rate_code, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, imp_surcharge, NULL AS ehail_fee, airport_fee, total_amount, pickup_location_id, dropoff_location_id, "Yellow" AS taxi_type , EXTRACT(YEAR FROM pickup_datetime) AS pickup_year, EXTRACT(HOUR FROM pickup_datetime) AS pickup_hour, DATE(pickup_datetime) AS pickup_date, TIME(pickup_datetime) AS pickup_time
FROM bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2020
WHERE EXTRACT(YEAR FROM pickup_datetime) = 2020 AND passenger_count > 0
UNION ALL
SELECT vendor_id, pickup_datetime, dropoff_datetime, trip_distance, rate_code, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, imp_surcharge, NULL AS ehail_fee, airport_fee, total_amount, pickup_location_id, dropoff_location_id, "Yellow" AS taxi_type , EXTRACT(YEAR FROM pickup_datetime) AS pickup_year, EXTRACT(HOUR FROM pickup_datetime) AS pickup_hour, DATE(pickup_datetime) AS pickup_date, TIME(pickup_datetime) AS pickup_time
FROM bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2021
WHERE EXTRACT(YEAR FROM pickup_datetime) = 2021 AND passenger_count > 0
UNION ALL
SELECT vendor_id, pickup_datetime, dropoff_datetime, trip_distance, rate_code, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, imp_surcharge, NULL AS ehail_fee, airport_fee, total_amount, pickup_location_id, dropoff_location_id, "Yellow" AS taxi_type , EXTRACT(YEAR FROM pickup_datetime) AS pickup_year, EXTRACT(HOUR FROM pickup_datetime) AS pickup_hour, DATE(pickup_datetime) AS pickup_date, TIME(pickup_datetime) AS pickup_time
FROM bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022
WHERE EXTRACT(YEAR FROM pickup_datetime) = 2022 AND passenger_count > 0
UNION ALL
SELECT vendor_id, pickup_datetime, dropoff_datetime, trip_distance, rate_code, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, imp_surcharge, ehail_fee, airport_fee, total_amount, pickup_location_id, dropoff_location_id, "Green" AS taxi_type, EXTRACT(YEAR FROM pickup_datetime) AS pickup_year, EXTRACT(HOUR FROM pickup_datetime) AS pickup_hour, DATE(pickup_datetime) AS pickup_date, TIME(pickup_datetime) AS pickup_time
FROM bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2014
WHERE EXTRACT(YEAR FROM pickup_datetime) = 2014 AND passenger_count > 0
UNION ALL
SELECT vendor_id, pickup_datetime, dropoff_datetime, trip_distance, rate_code, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, imp_surcharge, ehail_fee, airport_fee, total_amount, pickup_location_id, dropoff_location_id, "Green" AS taxi_type, EXTRACT(YEAR FROM pickup_datetime) AS pickup_year, EXTRACT(HOUR FROM pickup_datetime) AS pickup_hour, DATE(pickup_datetime) AS pickup_date, TIME(pickup_datetime) AS pickup_time
FROM bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2015
WHERE EXTRACT(YEAR FROM pickup_datetime) = 2015 AND passenger_count > 0
UNION ALL
SELECT vendor_id, pickup_datetime, dropoff_datetime, trip_distance, rate_code, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, imp_surcharge, ehail_fee, airport_fee, total_amount, pickup_location_id, dropoff_location_id, "Green" AS taxi_type, EXTRACT(YEAR FROM pickup_datetime) AS pickup_year, EXTRACT(HOUR FROM pickup_datetime) AS pickup_hour, DATE(pickup_datetime) AS pickup_date, TIME(pickup_datetime) AS pickup_time
FROM bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2016
WHERE EXTRACT(YEAR FROM pickup_datetime) = 2016 AND passenger_count > 0
UNION ALL
SELECT vendor_id, pickup_datetime, dropoff_datetime, trip_distance, rate_code, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, imp_surcharge, ehail_fee, airport_fee, total_amount, pickup_location_id, dropoff_location_id, "Green" AS taxi_type, EXTRACT(YEAR FROM pickup_datetime) AS pickup_year, EXTRACT(HOUR FROM pickup_datetime) AS pickup_hour, DATE(pickup_datetime) AS pickup_date, TIME(pickup_datetime) AS pickup_time
FROM bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2017
WHERE EXTRACT(YEAR FROM pickup_datetime) = 2017 AND passenger_count > 0
UNION ALL
SELECT vendor_id, pickup_datetime, dropoff_datetime, trip_distance, rate_code, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, imp_surcharge, ehail_fee, airport_fee, total_amount, pickup_location_id, dropoff_location_id, "Green" AS taxi_type, EXTRACT(YEAR FROM pickup_datetime) AS pickup_year, EXTRACT(HOUR FROM pickup_datetime) AS pickup_hour, DATE(pickup_datetime) AS pickup_date, TIME(pickup_datetime) AS pickup_time
FROM bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2018
WHERE EXTRACT(YEAR FROM pickup_datetime) = 2018 AND passenger_count > 0
UNION ALL
SELECT vendor_id, pickup_datetime, dropoff_datetime, trip_distance, rate_code, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, imp_surcharge, ehail_fee, airport_fee, total_amount, pickup_location_id, dropoff_location_id, "Green" AS taxi_type, EXTRACT(YEAR FROM pickup_datetime) AS pickup_year, EXTRACT(HOUR FROM pickup_datetime) AS pickup_hour, DATE(pickup_datetime) AS pickup_date, TIME(pickup_datetime) AS pickup_time
FROM bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2019
WHERE EXTRACT(YEAR FROM pickup_datetime) = 2019 AND passenger_count > 0
UNION ALL
SELECT vendor_id, pickup_datetime, dropoff_datetime, trip_distance, rate_code, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, imp_surcharge, ehail_fee, airport_fee, total_amount, pickup_location_id, dropoff_location_id, "Green" AS taxi_type, EXTRACT(YEAR FROM pickup_datetime) AS pickup_year, EXTRACT(HOUR FROM pickup_datetime) AS pickup_hour, DATE(pickup_datetime) AS pickup_date, TIME(pickup_datetime) AS pickup_time
FROM bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2020
WHERE EXTRACT(YEAR FROM pickup_datetime) = 2020 AND passenger_count > 0
UNION ALL
SELECT vendor_id, pickup_datetime, dropoff_datetime, trip_distance, rate_code, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, imp_surcharge, ehail_fee, airport_fee, total_amount, pickup_location_id, dropoff_location_id, "Green" AS taxi_type, EXTRACT(YEAR FROM pickup_datetime) AS pickup_year, EXTRACT(HOUR FROM pickup_datetime) AS pickup_hour, DATE(pickup_datetime) AS pickup_date, TIME(pickup_datetime) AS pickup_time
FROM bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2021
WHERE EXTRACT(YEAR FROM pickup_datetime) = 2021 AND passenger_count > 0
UNION ALL
SELECT vendor_id, pickup_datetime, dropoff_datetime, trip_distance, rate_code, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, imp_surcharge, ehail_fee, airport_fee, total_amount, pickup_location_id, dropoff_location_id, "Green" AS taxi_type, EXTRACT(YEAR FROM pickup_datetime) AS pickup_year, EXTRACT(HOUR FROM pickup_datetime) AS pickup_hour, DATE(pickup_datetime) AS pickup_date, TIME(pickup_datetime) AS pickup_time
FROM bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2022
WHERE EXTRACT(YEAR FROM pickup_datetime) = 2022 AND passenger_count > 0
)
WHERE
passenger_count > 0
trip_distance > 0
AND fare_amount > 0
AND (extra >= 0 OR extra IS NULL)
AND (mta_tax >= 0 OR mta_tax IS NULL)
AND (tip_amount >= 0 OR tip_amount IS NULL)
AND (tolls_amount >= 0 OR tolls_amount IS NULL)
AND (imp_surcharge >= 0 OR imp_surcharge IS NULL)
AND (airport_fee >= 0 OR airport_fee IS NULL)
AND dropoff_datetime > pickup_datetime
AND trip_distance <= 100
AND total_amount <= 350
The cleaned data is loaded in Power BI importing from BigQuery through DirectQuery. As the data source is very huge, it is impossible to import without using DirectQuery. Power BI would not be able to handle this amount data. DirectQuery allows connection directly to a data source in its original source repository such as BigQuery, effectively handling very huge datasets. With these, there are limitations in using DirectQuery such as limited extensive data transformations. This is why all data cleaning and querying is done beforehand using BigQuery SQL.
A separate date table is beneficial in order to seamlessly create visualizations that make use of date and time fields such as year, month, day of the week, and hour of the day. The date table is created using Power Query M Code. It includes the date, year, month number, short month format, short day format, day of week number, quarter, and year-quarter.
The hour table is created separately for simplicity. It consists of hour number (0-23), time, and the desired time format including only the hour. This table is beneficial in determining the busiest hour of the day.
The main data table from BigQuery contains all the data of each trip. There are three other lookup tables date_table, hour_table, and nyc_zone table. The date and hour tables, as stated previously, is used for visualizations that have date and time fields while the nyc_zone table is necessary for the map visualization to work properly. The date_table is connected to the main table through a many-to-one relationship to pickup_date column. The pickup_hour column in main table is connected to the Hour Number column in the hour table with a many-to-one relationship. Similarly, a many-to-one relationship is established between pickup/dropoff location id in the main table to the locationid column in the nyc_zones table.
This interactive dashboard in Power Bi offers a comprehensive visual representation of the city's dynamic taxi industry. Through interactive charts and maps, this dashboard provides a deep dive into the patterns, trends, and behaviors of taxi trips across different zones and timeframes. By leveraging data analysis and visualization, valuable insights into passenger preferences, peak hours, trip distances, and tipping behaviors is conveyed.
Over the years, there has been a noticeable decrease in the number of taxi trips. This decline could potentially be influenced by the rise of ride-sharing platforms like Uber, offering convenient hailing and payment methods.
The year 2020 marked the lowest point in trip counts. This was primarily due to the onset of the pandemic, leading to reduced tourism, remote work arrangements, and social isolation. Subsequently, starting from 2020, there appears to be a positive trend, possibly linked to the decline in COVID-19 cases and increased vaccination efforts. However, it's important to note that trip counts remain considerably lower compared to previous years.
The heatmap visualization of taxi trips count based on hours and days of the week aims to provide valuable insights into the busiest hours and days for taxi trips, highlighting potential opportunities for drivers to maximize their earnings during peak periods.
The busiest hours for taxi trips appear to be around 6-7 pm on weekdays. This could correspond to the end of the workday and the start of leisure activities or people taking taxi trips home. Additionally, a significant number of trips are observed around 8 am which might indicate the start of work hours.
There is a noticeable difference in trip density between weekdays and weekends. During weekdays, the peak hours are concentrated on the start and end of office hours. There is also a significant drop in trip density during the early morning hours (4-8 am), possibly reflecting fewer travel needs during these hours. On Fridays and Sundays, a high trip density is observed from 6 pm until the early hours of the next day. This trend may be attributed to people leaving work and engaging in activities during these hours. However, on Sundays, there is a period of lower density from 8 pm to 11 pm, possibly indicating an inclination to return home earlier in preparation for the upcoming workweek.
Friday emerges as the day with the highest number of trips, followed by Saturday and then Thursday. This suggests that these days are generally busier for taxi drivers, possibly due to increased recreational and social activities.
Overall, peak hour and day patterns appear similar for yellow and green taxis, with Friday and Saturday nights being more prominent in green taxis as seen in the heatmap visualizations below.
To visualize the density of trips in each zone, an open source custom map of NYC zones (https://maps.princeton.edu/catalog/nyu-2451-36743) was used since the default map in Power BI does not support zone names in NYC.
When observing the density maps depicting trip counts, a clear distinction emerges between the pickup locations of yellow and green taxis. This disparity can be attributed to the fact that green taxis are restricted from picking up passengers in specific zones, such as Manhattan south of East 96th Street and south of West 110th Street, as well as at La Guardia and John F. Kennedy airports. Meanwhile, yellow taxis have the authorization to pick up passengers anywhere in New York City (https://www.nyc.gov/site/tlc/businesses/green-cab-markings.page).
Prominent zones for yellow taxis include Upper East Side North, Upper East Side South, and Midtown Manhattan, with notable pickups at JFK and LaGuardia airports. Midtown Manhattan holds a reputation as a bustling hub, drawing tourists to its iconic sites like the Empire State Building and Times Square. Similarly, the Upper East Side North is renowned for its affluent residents and upscale real estate offerings, including lavish apartments and penthouses (https://metropolismoving.com/blog/neighborhoods-in-manhattan-explained/). This suggests that yellow taxi riders are likely either local residents commuting between wealthy neighborhoods and nearby business districts, or tourists arriving from airports and visiting popular attractions in the vicinity.
The heatmap below reinforces this observation, with passengers from Upper East Side and Upper East Side South being most frequently picked up around 8am and 6pm, aligning with typical work hours. In contrast, the heatmap for JFK Airport, LaGuardia Airport, and Midtown Center shows a more dispersed pattern, potentially indicating that these pickups are more likely related to tourist activities.
The primary pickup zones for green taxis include East Harlem North and South, Central Harlem, and Astoria. Harlem, situated north of Central Park, is renowned for its cultural diversity and hosts various esteemed educational institutions such as Columbia University, The Apollo Theater, City College of NYC, and the Manhattan School of Music (https://metropolismoving.com/blog/neighborhoods-in-manhattan-explained/). It can be deduced that passengers being picked up in Harlem are likely residents commuting to work or school. This inference is reinforced by the heatmap, indicating peak activity during 8-9am and 5-6pm, corresponding to typical work and school hours.
Astoria, on the other hand is known for its vibrant nightlife (https://www.compass.com/neighborhood-guides/nyc/astoria/). From this zone, passengers are likely heading to nearby destinations for leisure and socializing. This deduction aligns with the heatmap, revealing heightened activity on Friday and Saturday nights.
Overall, the maps visualization of trip counts per zone along with the heatmap hourly and weekly patterns have provided valuable insights into passenger behavior and ride purposes based on pickup locations. This understanding can contribute to service optimization and predictive pickup strategies, enhancing the overall efficiency and effectiveness of the taxi service.
The average tip amount overall is approximately 1.55 USD. This average can serve as a benchmark for assessing the tipping behavior across different locations. Interestingly, pickups at Newark Airport exhibit the highest average tip, despite the relatively fewer number of trips. This suggests that passengers using taxi services from Newark Airport are more inclined to provide higher tips, possibly due to the convenience or quality of service. Meanwhile, green taxis in Queens receive a substantially lower average tip of 0.87 USD compared to yellow taxis with an average tip of 4.25 USD. This disparity could be attributed to various factors, such as passenger preferences, ride experiences, or cultural nuances.
The fact that fare amounts account for 80.92% of the total payment suggests that passengers allocate a significant portion of their payment to cover the base cost of the ride itself. This highlights the importance of fare amounts in passengers' budget considerations. Despite the fare amount being the major component, the 10.23% tip percentage signifies that passengers consider tipping to be an essential part of their transaction. This indicates a cultural or societal norm of recognizing and rewarding the service provided by taxi drivers.
By clearly communicating and itemizing the breakdown of fees, passengers gain a better understanding of their total payment, which fosters trust in the service provider. Simultaneously, taxi drivers can also appreciate the origin of these fees, enhancing their awareness and comprehension of the payment structure. This transparency serves as a foundation for a fair and respectful transaction process, contributing to a positive experience for all parties involved.